April 11, 2019

Slides

Overview

  • My career path
  • Data science as fulfilling work
  • How to get a job in data science
  • Appendix
    • Installing R & RStudio
    • Resources to learn R
    • Publicly available data

My Career Path

Education

  • Luther College '12
    • B.A.
    • Major: Psychology
    • Minor: Music
    • Research: Forgiveness and Health
  • Iowa State University '15
    • M.S. & Certificate in Quantitative Psychology
    • Major: Psychology
    • Research: Police Interrogation Tactics and False Confessions; Quantitative Methods

Job History

  • D.M. Services, Inc. (2015 - Current)
    • Customer Satisfaction Data Analyst
    • Data Scientist

Getting from A to B

  • Major interests
    • Working with data
    • Programming
  • Skills
    • Research methods
    • Statistics and modeling
    • Design of experiments
    • Diverse research background
    • Communicating to non-technical audience

Common Educational Tracks for Data Science

  • Statistics
  • Computer Science
  • Economics
  • Math
  • Business
  • Psychology
  • Physics
  • Engineering
  • "…or other related quantitative field of study"

Data Science as Fulfilling Work

What Do You Do?

  • Data wrangling
  • Feature engineering
  • Statistical modeling
  • Report creation
  • Make empirical predictions
  • Design experiments
  • Employee interviewing and onboarding
  • Professional development
  • Code development
  • Auditing

What Do You Actually Do?

  • Help the company make good decisions (with data)

How Do You Do That?

Typical Questions

  • How likely is it that customer will respond to a marketing campaign?
  • Given response, will customer pay their bill?
  • Is customer actually who they say they are?
  • What products should we market to customer?
  • What are the strongest predictors of customer satisfaction?
  • What factors contribute to call duration and call silence?
  • What is the best way to measure phone agent quality?

Why Data Science?

  • Fast-paced work
  • Always working on something new
  • Access to data on massive scale
  • Large impact on business
  • Consult with other DS on their work
  • Complex problems require high attention to detail and technical skill
  • An engaging field of work that demands lifelong-learning

Behavioral Statistics vs. Machine Learning

Behavioral Statistics

  • Interpretability
  • Theoretical
  • Common analytic tools
    • lm()
    • glm()
    • aov()
    • t.test()

Machine Learning

  • Predictability
  • Applied
  • Common analytic tools
    • nnet::nnet()
    • xgboost::xgboost()
    • sparklyr::ml_random_forest()

Exciting Topics in Data Science

  • Machine learning
  • Deep learning
  • Optimization
  • Automation
  • Recommendation engines
  • Natural language processing

Getting a Job

Soft Skills

  • Curiosity
  • Creativity
  • Competency
  • Communication

Hard Skills

  • Coding/BI
    • SQL
    • R
    • Python
    • Shell
    • Excel
    • Git/GitHub
  • Statistics
    • Predictive modeling
    • Machine learning

How to Develop Technical Skills

FiveThirtyEight: The Riddler

How many games would we expect to be needed to complete a best-of-seven series if each team has a 50 percent chance of winning each individual game?

How about if one team has a 60 percent chance of winning each game?

How about 70?

Riddle
Answer

riddler <- function(a_pr = 0.5) {
  a_wins <- 0
  b_wins <- 0
  
  while (a_wins < 4 && b_wins < 4) {
    game <- sample(c("A", "B"),
                   size = 1,
                   prob = c(a_pr, 1 - a_pr))
    
    if (game == "A") {
      a_wins <- a_wins + 1
    } else {
      b_wins <- b_wins + 1
    }
    
  }
  
  data.frame(a_pr = a_pr,
             n_games = a_wins + b_wins)
  
}

set.seed(20190411)

riddler()
##   a_pr n_games
## 1  0.5       6

rep(c(0.5, 0.6, 0.7), each = 2)
## [1] 0.5 0.5 0.6 0.6 0.7 0.7
library(tidyverse)

set.seed(20190411)

map_dfr(rep(c(0.5, 0.6, 0.7), each = 10000),
        riddler) %>% 
  head()
##   a_pr n_games
## 1  0.5       6
## 2  0.5       5
## 3  0.5       6
## 4  0.5       6
## 5  0.5       5
## 6  0.5       7

set.seed(20190411)

map_dfr(rep(c(0.5, 0.6, 0.7), each = 10000),
        riddler) %>% 
  group_by(a_pr) %>% 
  summarize(avg_games = mean(n_games),
            n = n())
## # A tibble: 3 x 3
##    a_pr avg_games     n
##   <dbl>     <dbl> <int>
## 1   0.5      5.82 10000
## 2   0.6      5.70 10000
## 3   0.7      5.38 10000

A Great Riddle!

  • Creating user functions
  • while() loops
  • if() logic
  • Diverse data structures
  • Simulation
  • Iteration
  • Descriptive statistics
  • Auditing code and results
  • Easy to extend and visualize

library(plotly)

set.seed(20190411)

# this simulation will take longer to run
map_dfr(rep(seq(0.50, 1.00, 0.01), each = 10000),
        riddler) %>% 
  group_by(a_pr) %>% 
  summarize(avg_games = mean(n_games),
            n = n()) %>% 
  plot_ly(x = ~a_pr, y = ~avg_games, type = "scatter", mode = "lines",
          text = ~avg_games, hoverinfo = "x+text") %>% 
  layout(title = "Expected Number of Games in Best-of-Seven Series 
         \nGiven Varying Win Probability",
         xaxis = list(title = "Team A Win Probability"),
         yaxis = list(title = "Games", range = c(0, 7)),
         hovermode = "compare")

Ways to Stay Up-to-Date

Advice to Get Hired

  • Internship
  • Graduate school
  • Complete a research project outside of class
    • Faculty research
    • Web scraping
    • Publicly available data (see Appendix)

Contact

Colony Brands, Inc.

  • Analytic opportunities
    • Credit risk
    • Marketing
    • Retail
    • Product
    • Fraud
    • Customer experience
  • Paid summer internship
  • Full-time analyst position

To apply or learn more visit www.colonybrands.com

R You Ready? Taking Small Steps Toward Becoming an R Statistics User

  • Talk
    • Friday, April 12 (TOMORROW)
    • 4:00 pm
    • Valders 350
  • Overview
    • Getting Started
    • Essential R: 10 Packages for the New and Intermediate UseR
    • Resources for Future Learning

Contact


justin.marschall@imsdm.com
jcmarschall
justinmarschall
justinmarschall

Slides

Appendix

Installing R & RStudio

Resources to Learn R

More Resources to Learn R

Publicly Available Data

Licenses